Skip to content

[release][train] Adding py3.13 ray-ml image with torchft-nightly#63587

Open
elliot-barn wants to merge 11 commits into
masterfrom
elliot-barn-add-torchft-to-ml-release-image
Open

[release][train] Adding py3.13 ray-ml image with torchft-nightly#63587
elliot-barn wants to merge 11 commits into
masterfrom
elliot-barn-add-torchft-to-ml-release-image

Conversation

@elliot-barn

@elliot-barn elliot-barn commented May 21, 2026

Copy link
Copy Markdown
Collaborator

creating a ray-ml py3.13 release test image with torchft-nightly

Creating a python 3.13 variation of training_ingest_benchmark-task=image_classification for full_training.jpeg and full_training.s3_url

release test run: https://buildkite.com/ray-project/release/builds/93976

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn requested a review from a team as a code owner May 21, 2026 23:45
@elliot-barn elliot-barn requested a review from TimothySeah May 21, 2026 23:45
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Python 3.13 across the build and release infrastructure, including updates to Buildkite configurations, dependency lock files, and BYOD requirements. It also adds a new suite of nightly training ingest benchmarks for Python 3.13. Feedback was provided regarding a potential typo in a configuration flag, redundant dependency declarations in the new requirements file, and inconsistent argument formatting in the release test definitions.

I am having trouble creating individual review comments. Click here to see my feedback.

release/release_tests.yaml (1967)

high

anyscale_sdk_2026: true appears to be a typo. This flag is typically anyscale_sdk_v2: true in Ray release tests. Please verify if this is the intended key.

    anyscale_sdk_v2: true

release/ray_release/byod/requirements_ml_byod_3.13.in (43-44)

medium

Both torchft==0.1.1 and torchft-nightly are listed. Since the pull request aims to include the nightly version, the stable version is redundant and may cause installation conflicts. It should be removed.

torchft-nightly

release/release_tests.yaml (2038)

medium

The arguments --skip_train_step True and --skip_validation_at_epoch_end True use a space-separated format, which is inconsistent with the --arg=value format used in all other variations of this test (e.g., lines 1989, 2058). Using the consistent format improves maintainability and avoids potential parsing issues.

        script: RAY_TRAIN_V2_ENABLED=1 python train_benchmark.py --task=image_classification --dataloader_type=ray_data --num_workers=16 --skip_train_step=True --skip_validation_at_epoch_end=True --image_classification_data_format=s3_url

@ray-gardener ray-gardener Bot added train Ray Train Related Issue release-test release test labels May 22, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 5, 2026
elliot-barn and others added 6 commits June 9, 2026 15:06
Add a self-contained raydepsets depset (release_ml_torchft_tests.depsets.yaml)
that compiles the Ray ML release-test dependencies with torchft-nightly layered
on top, producing release/ray_release/byod/ml_torchft_py3.13.lock for
py3.13 / cu128. It is installed onto the core Ray CUDA image via
byod_ml_torchft.sh, so torchft release tests no longer depend on the published
py3.13 ray-ml image (which fails to build due to dask/nixl py3.13 gaps).

Decouple from the in-progress published py3.13 ray-ml image work by reverting
the buildkite image/release steps, ray-images.json, the gpu BYOD py3.13
allowance, and the ml-base-extra-testdeps py3.13 depset + locks, and by removing
torchft from the shared requirements_ml_byod_*.in files. torchft now lives only
in the dedicated requirements_ml_torchft.in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…ft.txt

Align the py3.13 torchft release-image depset with master after the torch 2.9.0
upgrade (#63361):

- Bump requirements_ml_byod_3.13.in to torch==2.9.0 and drop the stale
  triton==3.3.0 pin (torch 2.9.0 pulls triton==3.5.0 transitively), matching the
  py3.13 constraint and ML requirement files.
- Source torchft from the canonical python/requirements/ml/py313/torchft.txt
  (torchft-nightly==2026.5.15, torch-2.9.0-compatible) instead of a separate
  requirements_ml_torchft.in, so there is a single torchft pin.
- Regenerate ml_torchft_py3.13.lock -> torch==2.9.0+cu128 / torchaudio 2.11.0+cu128
  / triton 3.5.0; verified idempotent so raydepsets --check passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Add a minimal reference release test showing how to run a release test on the
torchft Ray ML image variant. It uses the core Ray CUDA image (py3.13) with the
torchft dependency lock installed on top:

  cluster:
    anyscale_sdk_2026: true
    byod:
      type: cu123
      post_build_script: byod_ml_torchft.sh
      python_depset: ml_torchft_py3.13.lock

The workload imports torch (2.9.0) + torchft and runs a short Ray Train v2 +
torchft linear training loop to prove the image works end to end. Validated
against the release schema (//release:test_config).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Setting byod.python_depset is sufficient: the BYOD image build automatically
copies the lock in and runs `uv pip install --system --no-deps -r
python_depset.lock` (release/ray_release/byod/build_context.py). The custom
byod_ml_torchft.sh ran the identical command, so it installed the deps a second
time for no reason.

Remove byod_ml_torchft.sh and the post_build_script reference from the
torchft_hello_world reference test; rely on python_depset alone. Validated with
//release:test_config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@github-actions github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Jun 10, 2026
elliot-barn and others added 2 commits June 10, 2026 03:38
The py3.13 ML stack pinned sentencepiece==0.1.96 via
python/requirements/ml/py313/train-test-requirements.txt, which flows into
requirements_compiled_py3.13.txt. sentencepiece 0.1.96 ships no cp313 wheel, so
`uv pip install --no-deps` falls back to building it from sdist, which needs
cmake (absent in the BYOD base image) and fails. This only surfaced on the
py3.13 torchft image; the py3.10/3.12 ML images have a cp310/cp312 wheel.

Bump to sentencepiece==0.2.1 (first release with cp313 manylinux wheels),
regenerate requirements_compiled_py3.13.txt (only the sentencepiece pin
changes), and regenerate ml_torchft_py3.13.lock. No other py3.13 deplock pins
sentencepiece, so the change is contained.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn added the go add ONLY when ready to merge, run all tests label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests release-test release test train Ray Train Related Issue unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant